Predictive Capabailites of
LASSO Regression

Kayla Hayes, Kristy Tarano, HJ Kim, Andrew Melara-Suazo

2024-07-29

Introduction

  • Diagnosing genetic disorders has been a significant challenge for thousands of years, requiring substantial time and resources, and is often hindered by inefficiencies in computerized decision-making systems in medical settings(Claussnitzer et al. 2020) (Y. Liu et al. 2023) (Tamburrano et al. 2020).
  • Machine learning has advanced the diagnosis of genetic disorders, with linear models effectively predicting disorder outcomes by targeting gene expression patterns(Raza et al. 2022) (S. Liu et al. 2019).
  • Logistic regression, optimized with LASSO (Least Absolute Shrinkage and Selection Operator) regression, simplifies models and improves the selection of independent variables, addressing issues like overfitting and multicollinearity, thereby enhancing predictive accuracy(Rusyana, Notodiputro, and Sartono 2021) (Ranstam and Cook 2018).
  • This study aims to use demographic and health indicators to predict disorder sub-classes by combining LASSO and multinomial logistic regression, focusing on key information to understand factors influencing diagnoses and improve diagnostic efficiency.

Methods

Multinomial Logistic Regression:

\[ logit(Pi)=ln(\frac{Pi}{Preference}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k \]

LASSO Regression:

\[ \hat{\beta} = \arg\min_{\beta} \left( \frac{1}{2n} \sum_{i=1}^n (y_i - X_i \beta)^2 + \lambda \sum_{j=1}^p |\beta_j| \right) \] L1 Penalization Portion:

\[ \lambda \sum_{j=1}^p |\beta_j| \]

K-Fold Cross Validation:

\[ \frac{1}{K} \sum_{k=1}^K \text{Err}_k \]

Analysis and Results

Data and Visualization

The data employed throughout this study was sourced from a genomics dataset which recorded various demographic and health indicators describing a set of patients’ respective disorders and disorder subclasses. The dataset contains 45 variables that are a mix between categorical and continuous types which ultimately describe the genetic disorders. This data was originally split into individual training and testing datasets, but for this study only the training dataset was utilized. The training data was split into its own training and testing set for proof of concept.

Variables in Original Dataset

# Packages Used
 
library(tidyverse)
library(knitr)
library(ggthemes)
library(ggrepel)
library(dslabs)
library(dplyr)
library(DT)
library(glmnet)
library(fastDummies)
library(nnet)
library(plyr)
library(readxl)
#install.packages("naniar")
library(naniar) 
#install.packages("VIM")
library(data.table)
#install.packages("mltools")
library(mltools)
library(ggplot2)
library(reshape2)
library(tidyr)
library(purrr)

# Load Data

Variables <- read.csv("Variables - Sheet1.csv", header = TRUE)

# Display the Variables table

datatable(Variables)

Target Data from Original Dataset

Target <- read.csv("Target - Sheet1.csv", header = TRUE)

# Display the Target Data table

datatable(Target)

Missing Values

During pre-processing, non-informative patient identification variables were cleaned from the dataset to remove insignificant predictive variables relating to the outcome. A multinomial logistic regression model was then developed to predict a genetic disorder outcome for patients based solely on their recorded information.

Because missing values were minimal, omission of missing values was performed to reduce the bias of introducing an imputation method as well as to meet the assumptions of binary logistic regression. The dataset was split using a 20:80 ratio, and K-Fold Cross Validation was used to find the most appropriate lambda value for performing LASSO regression. A lambda value of (0.0002451) was yielded and a multinomial logistic regression was then performed using R. Cross validation also reported a deviance of 99.9%. This means the goodness-of-fit is very high and that LASSO regression is able to account for virtually all variability amongst the variables of importance. The final model had an accuracy of 0.494 which was confirmed with a confusion matrix. The confusion matrix is able to create a ratio of the total amount of predictions that were true. This ultimately confirms the performance of the model on the testing data and predicting genetic disorder outcomes.

df <- read_excel("train_genetics.xlsx")
# dim(df)
# glimpse(df)
df <- df[, !names(df) %in% c("Patient Id", "Patient First Name", "Family Name", "Father's name", "Institute Name", "Location of Institute", "Test 1", "Test 2", "Test 3", "Test 4", "Test 5", "Symptom 1", "Symptom 2", "Symptom 3", "Symptom 4", "Symptom 5", "Parental consent", "Follow-up", "H/O radiation exposure (x-ray)", "H/O substance abuse", "Birth asphyxia")]
# datatable(df)

# Missing value check

datatable(miss_var_summary(df))
VIM::aggr(df,prop=FALSE,numbers=TRUE)

#df <- df %>% filter(!is.na(df$`Mother's age`))
#df <- df %>% filter(!is.na(df$`Father's age`))
#VIM::aggr(df,prop=FALSE,numbers=TRUE)
#naniar::miss_var_summary(df)

dt <- na.omit(df)
dim(dt)
[1] 7176   24
sum(is.na(dt))
[1] 0
is.data.table(dt) # to see if data.table
[1] FALSE
dt <- as.data.table(dt)

# Convert Categorical data into numerical data

# Function to convert categorical variables to numeric
convert_to_numeric <- function(x) {
  if (is.numeric(x)) return(x)
  factor_x <- as.factor(x)
  as.numeric(factor_x)
}

# List of columns to convert (excluding already numeric columns and ID)

columns_to_convert <- c("Status", "Respiratory Rate (breaths/min)", "Heart Rate (rates/min", "Autopsy shows birth defect (if applicable)", "Place of birth", "Assisted conception IVF/ART", "History of anomalies in previous pregnancies","Birth defects", "Blood test result", "Genetic Disorder", "Disorder Subclass", "Gender")

# Convert specified columns to numeric
dt[, (columns_to_convert) := lapply(.SD, convert_to_numeric), .SDcols = columns_to_convert]

# Special handling for "Respiratory Rate" and "Heart Rate" columns
dt[, `Respiratory Rate (breaths/min)` := as.numeric(sub("Normal \\(30-60\\)", "Tachycardia", `Respiratory Rate (breaths/min)`))]
dt[, `Heart Rate (rates/min` := as.numeric(sub("Normal", "Tachycardia", `Heart Rate (rates/min`))]
dt[, `Gender` := as.numeric(sub("Male", "Female", `Gender`))]


# Convert Yes/No columns to 1/0
yes_no_columns <- c("Genes in mother's side", "Inherited from father", "Maternal gene", "Paternal gene", "Folic acid details (peri-conceptional)", "H/O serious maternal illness")

dt[, (yes_no_columns) := lapply(.SD, function(x) as.numeric(x == "Yes")), .SDcols = yes_no_columns]


VIM::aggr(dt,prop=FALSE,numbers=TRUE)

Cleaned Data

datatable(dt)

Histograms

# EDA
# graphs

dt %>%
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram()

Correlation Heatmap

# creating correlation matrix
corr_mat <- round(cor(dt),2)
melted_corr_mat <- melt(corr_mat)

# plotting the correlation heatmap
ggplot(data = melted_corr_mat, aes(x=Var1, y=Var2,
                                   fill=value)) + 
  geom_tile()

Statistical Analysis

# loading packages 
library(tidyverse)
library(knitr)
library(ggthemes)
library(ggrepel)
library(dslabs)
library(glmnet)
library(fastDummies)
library(nnet)

#Loading data 
genetic_data <- read_csv('train_genetics.csv')

#Removing NA from data
clean_gene_data <- na.omit(genetic_data)

#Creating dummy columns
clean_gene_data <- dummy_cols(clean_gene_data, select_columns = c("Respiratory Rate (breaths/min)","Gender" ,"Heart Rate (rates/min", "H/O radiation exposure (x-ray)", "Birth asphyxia", "H/O substance abuse", "Birth defects", "H/O substance abuse", "Blood test result", "Disorder Subclass"), remove_first_dummy = TRUE)

# for consistency
set.seed(234)

# Get row indices for the training set and setting split
train_indices <- sample(seq_len(nrow(clean_gene_data)), size = 0.8 * nrow(clean_gene_data))

# Split the data into 80:20
train_data <- clean_gene_data[train_indices, ] # should be about 80% of the original data
test_data <- clean_gene_data[-train_indices, ] # should be about 20% of the original data


#Defining the response variable
y <- as.factor(train_data$`Genetic Disorder`)


#Defining the matrix of predictor variables for the model
x <- data.matrix(train_data[,c ("Genes in mother's side",'Maternal gene','Paternal gene' ,'Inherited from father', 'Blood cell count (mcL)', 'Respiratory Rate (breaths/min)_Tachypnea', 'Heart Rate (rates/min_Tachycardia','Gender_Female', 'Gender_Male','Birth asphyxia_No record', 'Birth asphyxia_Not available','Birth asphyxia_Yes','Folic acid details (peri-conceptional)', 'H/O serious maternal illness', 'H/O radiation exposure (x-ray)_No', 'H/O radiation exposure (x-ray)_Not applicable', 'H/O radiation exposure (x-ray)_Yes', 'H/O substance abuse_No', 'H/O substance abuse_Not applicable', 'H/O substance abuse_Yes', 'Assisted conception IVF/ART', 'History of a0malies in previous pregnancies', 'Birth defects_Singular', 'Blood test result_inconclusive', 'Blood test result_normal', 'Blood test result_slightly abnormal','White Blood cell count (thousand per microliter)', 'Symptom 1', 'Symptom 2','Symptom 3','Symptom 4', 'Symptom 5', 'Test 1', 'Test 2','Test 3','Test 4', 'Test 5', 'Disorder Subclass_Cancer', 'Disorder Subclass_Cystic fibrosis', 'Disorder Subclass_Diabetes', 'Disorder Subclass_Hemochromatosis', "Disorder Subclass_Leber's hereditary optic neuropathy", 'Disorder Subclass_Leigh syndrome', 'Disorder Subclass_Mitochondrial myopathy', 'Disorder Subclass_Tay-Sachs')])


#Looking for optimal lamda value using k-fold cross validation
cross_val_model <- cv.glmnet(x, y, family = "multinomial", alpha= 1)

#Looking for the best lambda value 
min_lambda <- cross_val_model$lambda.min

#Value of Lambda
min_lambda
[1] 0.0002451145
#| echo: true 
#Graph of test MSE error
plot(cross_val_model)
#Creating our LASSO regression model 
gene_model <- glmnet(x, y, family = "multinomial", alpha = 1, lambda = min_lambda)

#Coefficients of the model
coef(gene_model)
$`Mitochondrial genetic inheritance disorders`
46 x 1 sparse Matrix of class "dgCMatrix"
                                                             s0
                                                      -0.028792
Genes in mother's side                                 .       
Maternal gene                                          .       
Paternal gene                                          .       
Inherited from father                                  .       
Blood cell count (mcL)                                 .       
Respiratory Rate (breaths/min)_Tachypnea               .       
Heart Rate (rates/min_Tachycardia                      .       
Gender_Female                                          .       
Gender_Male                                            .       
Birth asphyxia_No record                               .       
Birth asphyxia_Not available                           .       
Birth asphyxia_Yes                                     .       
Folic acid details (peri-conceptional)                 .       
H/O serious maternal illness                           .       
H/O radiation exposure (x-ray)_No                      .       
H/O radiation exposure (x-ray)_Not applicable          .       
H/O radiation exposure (x-ray)_Yes                     .       
H/O substance abuse_No                                 .       
H/O substance abuse_Not applicable                     .       
H/O substance abuse_Yes                                .       
Assisted conception IVF/ART                            .       
History of a0malies in previous pregnancies            .       
Birth defects_Singular                                 .       
Blood test result_inconclusive                         .       
Blood test result_normal                               .       
Blood test result_slightly abnormal                    .       
White Blood cell count (thousand per microliter)       .       
Symptom 1                                              .       
Symptom 2                                              .       
Symptom 3                                              .       
Symptom 4                                              .       
Symptom 5                                              .       
Test 1                                                 .       
Test 2                                                 .       
Test 3                                                 .       
Test 4                                                 .       
Test 5                                                 .       
Disorder Subclass_Cancer                               .       
Disorder Subclass_Cystic fibrosis                      .       
Disorder Subclass_Diabetes                             .       
Disorder Subclass_Hemochromatosis                      .       
Disorder Subclass_Leber's hereditary optic neuropathy 10.148918
Disorder Subclass_Leigh syndrome                      10.132946
Disorder Subclass_Mitochondrial myopathy               9.690685
Disorder Subclass_Tay-Sachs                            .       

$`Multifactorial genetic inheritance disorders`
46 x 1 sparse Matrix of class "dgCMatrix"
                                                               s0
                                                       0.06377615
Genes in mother's side                                 0.04009867
Maternal gene                                          .         
Paternal gene                                          0.19224017
Inherited from father                                  0.11309765
Blood cell count (mcL)                                 .         
Respiratory Rate (breaths/min)_Tachypnea               .         
Heart Rate (rates/min_Tachycardia                      .         
Gender_Female                                          .         
Gender_Male                                            .         
Birth asphyxia_No record                               .         
Birth asphyxia_Not available                           .         
Birth asphyxia_Yes                                     .         
Folic acid details (peri-conceptional)                 .         
H/O serious maternal illness                           .         
H/O radiation exposure (x-ray)_No                      .         
H/O radiation exposure (x-ray)_Not applicable          .         
H/O radiation exposure (x-ray)_Yes                     .         
H/O substance abuse_No                                 .         
H/O substance abuse_Not applicable                     .         
H/O substance abuse_Yes                                .         
Assisted conception IVF/ART                            .         
History of a0malies in previous pregnancies            .         
Birth defects_Singular                                 .         
Blood test result_inconclusive                         .         
Blood test result_normal                               .         
Blood test result_slightly abnormal                    .         
White Blood cell count (thousand per microliter)       .         
Symptom 1                                              0.06942615
Symptom 2                                              0.47468900
Symptom 3                                              0.74021267
Symptom 4                                              0.96951046
Symptom 5                                              1.35989465
Test 1                                                 .         
Test 2                                                 .         
Test 3                                                 .         
Test 4                                                 .         
Test 5                                                 .         
Disorder Subclass_Cancer                               5.98053346
Disorder Subclass_Cystic fibrosis                      .         
Disorder Subclass_Diabetes                             4.81155092
Disorder Subclass_Hemochromatosis                      .         
Disorder Subclass_Leber's hereditary optic neuropathy -0.14134838
Disorder Subclass_Leigh syndrome                       .         
Disorder Subclass_Mitochondrial myopathy               .         
Disorder Subclass_Tay-Sachs                            .         

$`Single-gene inheritance diseases`
46 x 1 sparse Matrix of class "dgCMatrix"
                                                               s0
                                                      -0.03498415
Genes in mother's side                                 .         
Maternal gene                                          .         
Paternal gene                                          .         
Inherited from father                                  .         
Blood cell count (mcL)                                 .         
Respiratory Rate (breaths/min)_Tachypnea               .         
Heart Rate (rates/min_Tachycardia                      .         
Gender_Female                                          .         
Gender_Male                                            .         
Birth asphyxia_No record                               .         
Birth asphyxia_Not available                           .         
Birth asphyxia_Yes                                     .         
Folic acid details (peri-conceptional)                 .         
H/O serious maternal illness                           .         
H/O radiation exposure (x-ray)_No                      .         
H/O radiation exposure (x-ray)_Not applicable          .         
H/O radiation exposure (x-ray)_Yes                     .         
H/O substance abuse_No                                 .         
H/O substance abuse_Not applicable                     .         
H/O substance abuse_Yes                                .         
Assisted conception IVF/ART                            .         
History of a0malies in previous pregnancies            .         
Birth defects_Singular                                 .         
Blood test result_inconclusive                         .         
Blood test result_normal                               .         
Blood test result_slightly abnormal                    .         
White Blood cell count (thousand per microliter)       .         
Symptom 1                                              .         
Symptom 2                                              .         
Symptom 3                                              .         
Symptom 4                                              .         
Symptom 5                                              .         
Test 1                                                 .         
Test 2                                                 .         
Test 3                                                 .         
Test 4                                                 .         
Test 5                                                 .         
Disorder Subclass_Cancer                               .         
Disorder Subclass_Cystic fibrosis                     10.62654612
Disorder Subclass_Diabetes                             .         
Disorder Subclass_Hemochromatosis                      8.30586094
Disorder Subclass_Leber's hereditary optic neuropathy  .         
Disorder Subclass_Leigh syndrome                       .         
Disorder Subclass_Mitochondrial myopathy               .         
Disorder Subclass_Tay-Sachs                            9.02598863
print(gene_model)

Call:  glmnet(x = x, y = y, family = "multinomial", alpha = 1, lambda = min_lambda) 

  Df %Dev    Lambda
1 16 99.9 0.0002451
#Creating matrix of predictor variables from test data
x_test <- data.matrix(test_data[,c ("Genes in mother's side",'Maternal gene','Paternal gene' ,'Inherited from father', 'Blood cell count (mcL)', 'Respiratory Rate (breaths/min)_Tachypnea', 'Heart Rate (rates/min_Tachycardia','Gender_Female', 'Gender_Male','Birth asphyxia_No record', 'Birth asphyxia_Not available','Birth asphyxia_Yes','Folic acid details (peri-conceptional)', 'H/O serious maternal illness', 'H/O radiation exposure (x-ray)_No', 'H/O radiation exposure (x-ray)_Not applicable', 'H/O radiation exposure (x-ray)_Yes', 'H/O substance abuse_No', 'H/O substance abuse_Not applicable', 'H/O substance abuse_Yes', 'Assisted conception IVF/ART', 'History of a0malies in previous pregnancies', 'Birth defects_Singular', 'Blood test result_inconclusive', 'Blood test result_normal', 'Blood test result_slightly abnormal','White Blood cell count (thousand per microliter)', 'Symptom 1', 'Symptom 2','Symptom 3','Symptom 4', 'Symptom 5', 'Test 1', 'Test 2','Test 3','Test 4', 'Test 5', 'Disorder Subclass_Cancer', 'Disorder Subclass_Cystic fibrosis', 'Disorder Subclass_Diabetes', 'Disorder Subclass_Hemochromatosis', "Disorder Subclass_Leber's hereditary optic neuropathy", 'Disorder Subclass_Leigh syndrome', 'Disorder Subclass_Mitochondrial myopathy', 'Disorder Subclass_Tay-Sachs')])

#Defining response variable from test data 
y_test <- as.factor(test_data$`Genetic Disorder`)



#CONFUSION MATRIX BELOW 

#Creating prediction classes for the model
pred_prob <- predict(gene_model, newx = x_test, s = min_lambda, type = "class")

#Creating the prediction probability threshold for the model
pred_class <- ifelse(pred_prob > 0.05, 1, 0)

#Creating the confusion matrix
confusion_matrix <- table(Predicted = pred_class, Actual = y_test)

print(confusion_matrix)
         Actual
Predicted Mitochondrial genetic inheritance disorders
        1                                         663
         Actual
Predicted Multifactorial genetic inheritance disorders
        1                                          135
         Actual
Predicted Single-gene inheritance diseases
        1                              544
#Calculates the overall true positive/negatives over the sum of all predictions
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)

# Print the accuracy decimal
print(paste("Accuracy:", round(accuracy, 4)))
[1] "Accuracy: 0.494"
#Decimal proportion of true positives over true and false positive predictions
precision <- diag(confusion_matrix) / rowSums(confusion_matrix)

#Decimal proportion of true positives over true and false negative predictions
recall <- diag(confusion_matrix) / colSums(confusion_matrix)

#F1 Score measures accuracy and considers precision and recall and also accounts for false + and -
f1_score <- 2 * (precision * recall) / (precision + recall)
 

f1_score
 Mitochondrial genetic inheritance disorders 
                                   0.6613466 
Multifactorial genetic inheritance disorders 
                                   0.8977657 
            Single-gene inheritance diseases 
                                   0.7030753 

Conclusion

Performance

Accuracy: 0.494,   Kappa: 1 → same with a coin toss

Effective variables: genes in mother’s side, paternal gene, inherited from father    

Following the Process

Pre-processing → EDA → Analysis → Interpretation

  1. Deleting omitted rows and make the length → Check data types by Visualization

  2. Finding Lambda value → Model setup → LASSO regression → Confusion Matrix

Lessons Learned

Do we extract and use proper variables?

The parameter used in the model followed the right recipe → feature selection needs improvement

What if only same class of variables are used to make a model? e.g., binary, categorical, continuous is separately used

Selected variables through the feature selection might have helped to improve

References

Claussnitzer, Melina, Judy H Cho, Rory Collins, Nancy J Cox, Emmanouil T Dermitzakis, Matthew E Hurles, Sekar Kathiresan, et al. 2020. “A Brief History of Human Disease Genetics.” Nature 577 (7789): 179–89.
Liu, Shuai, Mengye Lu, Hanshuang Li, and Yongchun Zuo. 2019. “Prediction of Gene Expression Patterns with Generalized Linear Regression Model.” Frontiers in Genetics 10: 120.
Liu, Yanqiu, Liangwei Mao, Hui Huang, Wei Li, Jianfen Man, Wenqian Zhang, Lina Wang, et al. 2023. “Clinical Diagnosis of Genetic Disorders at Both Single-Nucleotide and Chromosomal Levels Based on BGISEQ-500 Platform.” Human Genome Variation 10 (1): 15.
Ranstam, Jonas, and Jonathan A Cook. 2018. “LASSO Regression.” Journal of British Surgery 105 (10): 1348–48.
Raza, Ali, Furqan Rustam, Hafeez Ur Rehman Siddiqui, Isabel de la Torre Diez, Begoña Garcia-Zapirain, Ernesto Lee, and Imran Ashraf. 2022. “Predicting Genetic Disorder and Types of Disorder Using Chain Classifier Approach.” Genes 14 (1): 71.
Rusyana, A, KA Notodiputro, and B Sartono. 2021. “The Lasso Binary Logistic Regression Method for Selecting Variables That Affect the Recovery of Covid-19 Patients in China.” In Journal of Physics: Conference Series, 1882:012035. 1. IOP Publishing.
Tamburrano, Andrea, Doriana Vallone, Cinzia Carrozza, Andrea Urbani, Maurizio Sanguinetti, Nicola Nicolotti, Andrea Cambieri, and Patrizia Laurenti. 2020. “Evaluation and Cost Estimation of Laboratory Test Overuse in 43 Commonly Ordered Parameters Through a Computerized Clinical Decision Support System (CCDSS) in a Large University Hospital.” PLoS One 15 (8): e0237159.